# install.packages("tidygraph")
library(tidygraph)
library(tidyverse)SNA data introduction
Session 4a
1 Network data in Tidyverse
With knowing the structure of network data, we can now turn to the basics of social network analysis using the tidygraph package in R.
tidygraph provides a tidy framework for working with network data, making it easy to manipulate data and visualization using the interfaces defined in the dyplyr and ggplot packages. It also provides tidy interfaces to many other established SNA packages in R, such as igraph.
1.1 Creating a graph object
Working with network data representation is normally start with creating a tbl_graph object. The tbl_graph object provides a structured way to store both node (i.e., actor) and edge (i.e., tie) data in a single object.
It consists of two data frames: one for nodes and another for edges. This structured representation makes it easier to work with graph data and ensures that the data is organized and consistent.
Let’s start with inspect a toy network data as an example and convert it into a tbl_graph object.
You can find two data frames of this toy data in our class Google Drive “SNA toy data” folder:
“toynet.csv”: the edgelist data containing information about the edges in the network
“toyatt.csv”: the node data containing information about the nodes in the network
toy_nodes <- read.csv("toyatt.csv")
toy_edgelist <- read.csv("toynet.csv")After read in the data sets, we convert them into the graph object using the tbl_graph() function.
ex.nw <- tbl_graph(nodes = toy_nodes,
edges = toy_edgelist,
directed=T) # TRUE if it is a directed network
ex.nw# A tbl_graph: 6 nodes and 12 edges
#
# A directed simple graph with 1 component
#
# Node Data: 6 × 4 (active)
node attr1 attr2 attr3
<int> <dbl> <dbl> <int>
1 1 2.4 2 1
2 2 2.6 2 1
3 3 1.1 1 1
4 4 -0.5 -0.5 0
5 5 -3 -2 0
6 6 -1 0.5 0
#
# Edge Data: 12 × 3
from to weight
<int> <int> <int>
1 1 2 1
2 1 3 1
3 2 1 1
# ℹ 9 more rows
class(ex.nw)[1] "tbl_graph" "igraph"
1.2 Nodes
We can extract the node and edge data from this graph object.
The nodes data stores all the relevant information of nodes, which functions similar to the meta data associated with the documents in text-as-data analyses.
as.list(ex.nw)$nodes# A tibble: 6 × 4
node attr1 attr2 attr3
<int> <dbl> <dbl> <int>
1 1 2.4 2 1
2 2 2.6 2 1
3 3 1.1 1 1
4 4 -0.5 -0.5 0
5 5 -3 -2 0
6 6 -1 0.5 0
1.3 Edgelist
This edge data is the edgelist, which contains two columns: “from” and “to”, representing the edges (i.e., ties) between nodes (i.e., actors).
as.list(ex.nw)$edges# A tibble: 12 × 3
from to weight
<int> <int> <int>
1 1 2 1
2 1 3 1
3 2 1 1
4 2 3 1
5 3 2 1
6 3 4 1
7 3 6 1
8 4 5 1
9 4 6 1
10 5 4 1
11 6 3 1
12 6 4 1
2 A case study: Harry Potter peer support networks
In the following instruction, we will walk through a case study of investigating the peer-support networks in the magic world of Harry Potter. This data is made possible by Goele Bossaert and Nadine Meidert (see here). The peer support ties mean voluntary emotional, instrumental, or informational support, or praise from one living, adolescent character to another within the book’s pages. In addition, characters’ attributes are included, including name, schoolyear, gender, and their house assigned by the sorting hat.
# install.packages("manynet")
library(manynet)
data(fict_potter) # In the older versions of `manynet`, HP data is called as ison_potterLet’s see the basic information about this network data. This network data is stored in three classes (i.e., types of network objects) and can be directly used by functions from the manynet, tidygraph and igraph. Throughout this class, we will focus on the tidygraph way.
class(fict_potter)[1] "mnet" "tbl_graph" "igraph"
fict_potter
── # Harry Potter support network ──────────────────────────────────────────────
# A longitudinal, labelled, complex, directed network of 64 students and 544
support arcs over 6 waves
── Nodes
# A tibble: 64 × 5
name schoolyear gender house active
<chr> <int> <chr> <chr> <logi>
1 Adrian Pucey 1989 male Slytherin TRUE
2 Alicia Spinnet 1989 female Gryffindor TRUE
3 Angelina Johnson 1989 female Gryffindor TRUE
4 Anthony Goldstein 1991 male Ravenclaw TRUE
# ℹ 60 more rows
── Changes
# A tibble: 81 × 4
time node var value
<int> <int> <chr> <lgl>
1 2 9 active TRUE
2 2 21 active TRUE
3 2 35 active TRUE
4 2 39 active FALSE
# ℹ 77 more rows
── Ties
# A tibble: 544 × 3
from to wave
<int> <int> <dbl>
1 11 11 1
2 11 25 1
3 11 26 1
4 11 44 1
# ℹ 540 more rows
How many support relationships exist in each book (defined by the “wave” variable)?
# tie distribution across the books
fict_potter %>%
activate(edges) %>%
as_tibble() %>% # we need to convert the edgelist to a data frame first before running other functions that requires a rectangular data structure.
group_by(wave) %>%
summarize(support.tie.count=n()) # A tibble: 6 × 2
wave support.tie.count
<dbl> <int>
1 1 47
2 2 110
3 3 104
4 4 49
5 5 160
6 6 74
In the following demonstration, we will use the supporting network data from the sixth Harry Potter book (“Harry Potter and the Half-Blood Prince”) and name it to hp.6.
As shown, it is a directed network with 64 actors and 74 ties.
hp.6<- fict_potter %>%
activate(edges) %>%
filter(wave == 6) # `filter()` can be directly applied to the edgelist. Similar functions include `arrange()` and `mutate()`.
hp.6── # Harry Potter support network ──────────────────────────────────────────────
# A longitudinal, labelled, complex, directed network of 64 students and 74
support arcs over 6 waves
── Nodes
# A tibble: 64 × 5
name schoolyear gender house active
<chr> <int> <chr> <chr> <logi>
1 Adrian Pucey 1989 male Slytherin TRUE
2 Alicia Spinnet 1989 female Gryffindor TRUE
3 Angelina Johnson 1989 female Gryffindor TRUE
4 Anthony Goldstein 1991 male Ravenclaw TRUE
# ℹ 60 more rows
── Changes
# A tibble: 81 × 4
time node var value
<int> <int> <chr> <lgl>
1 2 9 active TRUE
2 2 21 active TRUE
3 2 35 active TRUE
4 2 39 active FALSE
# ℹ 77 more rows
── Ties
# A tibble: 74 × 3
from to wave
<int> <int> <dbl>
1 11 11 6
2 11 25 6
3 11 56 6
4 11 58 6
# ℹ 70 more rows
2.1 Understanding the data
As like we do all analysis, we want to start with understanding our data. In SNA projects, we can manipulate the edges data, calculate the network structural characteristics (e.g., centrality measures of nodes), and learn the attributes of the actors.
Before doing these inspection, we need to use a pointer function activate() to tell R which data, either nodes or edges to work on.
2.2 Manipulating edges data
Here is another way, the tidy way, to extract the edgelist data.
hp.6_edgelist <- hp.6 %>%
activate(edges) %>%
as_tibble()
# Alternatively: as.list(hp.6)$edgesFrom the quick view of the edgelist, you might have noticed that self-nomination ties (self-loops) are included. A person can surely help her/himself. While in some cases, we don’t want self-nomination.
How to remove these self-nomination ties in R? This is equivalent to a data manipulation task we’ve learnt in the beginning of this class. Here we work with the filter() function again.
hp.6_no_self <- hp.6 %>%
activate(edges) %>%
filter(from != to) # exclude the edges where the "from" and "to" columns have the same value.2.3 The importance of nodes
The centrality measures quantify the importance of influence of nodes. Let’s calculate the out-degree centrality of the characters and find the five most helpful characters.
top5_offer_help<-hp.6_no_self %>%
activate(nodes) %>%
mutate(out_degree = centrality_degree(mode="out")) %>%
top_n(5, out_degree) %>% # selects top 5 nodes, allowing for ties (i.e.,nodes with same number of out_degrees)
select(name, out_degree) %>%
arrange(desc(out_degree)) %>%
as_tibble()
top5_offer_help# A tibble: 9 × 2
name out_degree
<chr> <dbl>
1 Harry James Potter 10
2 Ronald Weasley 8
3 Hermione Granger 6
4 Ginny Weasley 5
5 Dean Thomas 3
6 Fred Weasley 3
7 Luna Lovegood 3
8 Neville Longbottom 3
9 Seamus Finnigan 3
Unsurprisingly, the Trio are the most active helpers.
2.3.1 Inspecting nodes’ ties
Wait, who are these nodes? Let’s see the HP characters.
hp.6 %>%activate(nodes) %>%as_tibble() %>% pull(name) [1] "Adrian Pucey" "Alicia Spinnet" "Angelina Johnson"
[4] "Anthony Goldstein" "Blaise Zabini" "C. Warrington"
[7] "Cedric Diggory" "Cho Chang" "Colin Creevey"
[10] "Cormac McLaggen" "Dean Thomas" "Demelza Robins"
[13] "Dennis Creevey" "Draco Malfoy" "Eddie Carmichael"
[16] "Eleanor Branstone" "Ernie Macmillan" "Euan Abercrombie"
[19] "Fred Weasley" "George Weasley" "Ginny Weasley"
[22] "Graham Pritchard" "Gregory Goyle" "Hannah Abbott"
[25] "Harry James Potter" "Hermione Granger" "Jimmy Peakes"
[28] "Justin Finch-Fletchley" "Katie Bell" "Kevin Whitby"
[31] "Lavender Brown" "Leanne" "Lee Jordan"
[34] "Lucian Bole" "Luna Lovegood" "Malcolm Baddock"
[37] "Mandy Brocklehurst" "Marcus Belby" "Marcus Flint"
[40] "Michael Corner" "Miles Bletchley" "Millicent Bulstrode"
[43] "Natalie McDonald" "Neville Longbottom" "Oliver Wood"
[46] "Orla Quirke" "Owen Cauldwell" "Padma Patil"
[49] "Pansy Parkinson" "Parvati Patil" "Penelope Clearwater"
[52] "Percy Weasley" "Peregrine Derrick" "Roger Davies"
[55] "Romilda Vane" "Ronald Weasley" "Rose Zeller"
[58] "Seamus Finnigan" "Stewart Ackerley" "Susan Bones"
[61] "Terry Boot" "Theodore Nott" "Vincent Crabbe"
[64] "Zacharias Smith"
We can find the ties of a node using the node’s name. To do so, we use join family to link the nodes and edges data.
# Assign IDs to nodes
hp.6_with_id <- hp.6 %>%
activate(nodes) %>%
mutate(id = row_number()) # mutate function can be direcly applied to an edgelist in tidygraph
# Activate edges and join with nodes data to get the "name" variable
hp.6_edges_with_names.df <- hp.6_with_id %>%
activate(edges) %>%
as_tibble() %>% # again, we need to restructure the edgelist to a dataframe to run the following functions.
left_join(hp.6_with_id %>% activate(nodes),
by = c("from" = "id"), copy = TRUE) %>% # join/attach the names of the support senders
rename(from_name = name) %>%
left_join(hp.6_with_id %>% activate(nodes),
by = c("to" = "id"), copy = TRUE) %>% # join/attach the names of the support receivers
rename(to_name = name) %>%
select(from:from_name, to_name)Now, we can check out the ties of the specific node. Let’s see the help offered by or received by Harry Potter.
hp.6_edges_with_names.df %>%
filter(from_name == "Harry James Potter" | to_name == "Harry James Potter") %>%
select(from_name, to_name)# A tibble: 20 × 2
from_name to_name
<chr> <chr>
1 Dean Thomas Harry James Potter
2 Fred Weasley Harry James Potter
3 George Weasley Harry James Potter
4 Ginny Weasley Harry James Potter
5 Harry James Potter Demelza Robins
6 Harry James Potter Fred Weasley
7 Harry James Potter George Weasley
8 Harry James Potter Ginny Weasley
9 Harry James Potter Harry James Potter
10 Harry James Potter Hermione Granger
11 Harry James Potter Katie Bell
12 Harry James Potter Leanne
13 Harry James Potter Luna Lovegood
14 Harry James Potter Neville Longbottom
15 Harry James Potter Ronald Weasley
16 Hermione Granger Harry James Potter
17 Luna Lovegood Harry James Potter
18 Neville Longbottom Harry James Potter
19 Ronald Weasley Harry James Potter
20 Seamus Finnigan Harry James Potter
Can you find who received the most help? We need to calculate nodes’ **indegree*`** centrality, which indicates these characters’ popularity or the extent to which they receive support from others in the network.
top5_offer_received<-hp.6_no_self %>%
activate(nodes) %>%
mutate(in_degree = centrality_degree(mode="in")) %>%
top_n(5, in_degree) %>%
select(name, in_degree) %>%
arrange(desc(in_degree)) %>%
as_tibble()
top5_offer_received
What else can you learn from the actors’ centrality measures? For example, an actor with high out-degree centrality but low in-degree centrality may be a key provider of help but may not receive much support in return. On the other hand, an actor with high in-degree centrality but low out-degree centrality may be a frequent recipient of help but may not actively offer assistance to others.
Can you calculate the betweeness centrality and find the character who are important in facilitating the flow of help or resources through the network?
top5_bridge<-hp.6_no_self %>%
activate(nodes) %>%
mutate(betweenness = centrality_betweenness(directed = TRUE))%>%
top_n(5, out_degree) %>%
select(name, out_degree)
2.4 The attributes of nodes
While tidygraph provides its own set of functions for data manipulation, such as mutate() and filter(), there are situations where we might want to use more data manipulation functions from the dplyr package, such as group_by(), summarise(), and the others. However, dplyr functions are designed to work with data frames or tibbles, not directly with graph objects.
To bridge this compatibility gap, we use the as_tibble() function to convert the nodes or edges of a graph object into a tibble format. By applying as_tibble() after activating the nodes or edges with activate(), we create a tabular data structure that is compatible with dplyr functions.
See below the below example if we want to know the house distribution of characters.
hp.6_no_self %>%
activate(nodes) %>%
as_tibble() %>%
group_by(house) %>%
summarise(n=n()) %>%
mutate(proportion=round(n/sum(n),2))# A tibble: 4 × 3
house n proportion
<chr> <int> <dbl>
1 Gryffindor 25 0.39
2 Hufflepuff 11 0.17
3 Ravenclaw 13 0.2
4 Slytherin 15 0.23